[https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails by ziyixiong-nv · Pull Request #13172 · NVIDIA/TensorRT-LLM

ziyixiong-nv · 2026-04-18T00:26:22Z

Summary

Fix for NVBugs 5859886: [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer] timeout
Root cause: Skip DeepEP when NVLink symmetric memory init fails
Fix: (auto-detected from git commit)
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/5859886

Summary by CodeRabbit

Bug Fixes
- Improved NVLink workspace initialization failure handling with graceful fallback mechanism to prevent repeated initialization attempts on known failures.
Tests
- Removed skip waiver for a DeepSeek test case.

coderabbitai · 2026-04-18T00:30:37Z

📝 Walkthrough

Walkthrough

The changes add failure tracking and error handling for NVLink workspace initialization in the MOE communication module. A new class-level flag records initialization failures, and the strategy factory conditionally skips DeepEP strategies when this flag is set. A test waiver entry is removed.

Changes

Cohort / File(s)	Summary
MOE Communication Error Handling `tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`	Added class-level flag `_WORKSPACE_INIT_FAILED` to track NVLink initialization failures. `nvlink_one_sided.py` now wraps workspace allocation in try/except to capture `RuntimeError` and `AssertionError`, setting the flag and re-raising on failure. `communication_factory.py` conditionally bypasses DeepEP/DeepEPLowLatency strategy selection when the flag is set, falling back to `AllGatherReduceScatter`.
Test Waivers `tests/integration/test_lists/waives.txt`	Removed waiver entry for `accuracy/test_llm_api_pytorch.py::TestDeepSeekV32::test_fp8_blockscale[disable_skip_indexer]`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description includes a summary, test plan, and links, but lacks detailed explanation of what and why the changes were made.	Expand the description to explain the root cause, the implementation approach, and how the fix resolves the timeout issue in more detail.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: skipping DeepEP when NVLink symmetric memory initialization fails, which directly addresses the root cause of the timeout issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py (1)

230-237: Consider checking _WORKSPACE_INIT_FAILED before MnnvlMemory.initialize().

The failure check (line 232) occurs after MnnvlMemory.initialize() (line 230). If MnnvlMemory.initialize() is safe to call repeatedly regardless of workspace state, this is fine. However, moving the check earlier would be slightly more defensive.

♻️ Optional: Move failure check before MnnvlMemory.initialize()

+        if self._WORKSPACE_INIT_FAILED:
+            raise RuntimeError(
+                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
+                "failed on this node, skipping repeated initialization attempt."
+            )
+
         # Initialize or reuse workspace
         MnnvlMemory.initialize()
 
-        if self._WORKSPACE_INIT_FAILED:
-            raise RuntimeError(
-                "NVLinkOneSided: workspace initialization (MNNVL/NVSHMEM) previously "
-                "failed on this node, skipping repeated initialization attempt."
-            )
-
         if self._WORKSPACE is None:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`
around lines 230 - 237, The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py`:
- Around line 230-237: The _WORKSPACE_INIT_FAILED flag should be checked before
attempting to re-initialize workspace resources; move the check for
self._WORKSPACE_INIT_FAILED to immediately before the call to
MnnvlMemory.initialize() in NVLinkOneSided so that if the workspace init
previously failed we raise the RuntimeError early and skip calling
MnnvlMemory.initialize(); keep the same RuntimeError message and behavior
otherwise.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ef44833d-2b45-4f1e-8224-51e047ee2be3

📥 Commits

Reviewing files that changed from the base of the PR and between 813d877 and 8f778a3.

📒 Files selected for processing (3)

tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

…ails When NVLinkOneSided workspace initialization fails (MNNVL allocation or NVSHMEM moe_a2a_initialize), cache the failure and propagate it to skip DeepEP/DeepEPLowLatency strategies in CommunicationFactory. DeepEP also relies on NVSHMEM internally and would hang during forward pass if the NVLink symmetric memory infrastructure is unavailable. Changes: - NVLinkOneSided: wrap workspace init (MnnvlMemory + moe_a2a_initialize) in try-except, set _WORKSPACE_INIT_FAILED on failure to avoid repeated attempts across MoE layers and signal the factory. - CommunicationFactory: check _WORKSPACE_INIT_FAILED before trying DeepEP/DeepEPLowLatency; fall through to AllGatherReduceScatter (NCCL). - Remove test waiver for test_fp8_blockscale[disable_skip_indexer]. Signed-off-by: Ziyi Xiong <219238287+ziyixiong-nv@users.noreply.github.com>

ziyixiong-nv · 2026-04-20T02:23:21Z

/bot run

tensorrt-cicd · 2026-04-20T02:29:12Z

PR_Github #44256 [ run ] triggered by Bot. Commit: ef935a1 Link to invocation

tensorrt-cicd · 2026-04-20T03:33:07Z

PR_Github #44256 [ run ] completed with state FAILURE. Commit: ef935a1
/LLM/main/L0_MergeRequest_PR pipeline #34677 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ziyixiong-nv · 2026-04-20T05:56:43Z

/bot run

tensorrt-cicd · 2026-04-20T06:03:15Z

PR_Github #44339 [ run ] triggered by Bot. Commit: ef935a1 Link to invocation

tensorrt-cicd · 2026-04-20T06:12:58Z

PR_Github #44339 [ run ] completed with state FAILURE. Commit: ef935a1
/LLM/main/L0_MergeRequest_PR pipeline #34756 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

- Update copyright year to 2025-2026 - Mark _WORKSPACE and _WORKSPACE_INIT_FAILED as ClassVar - Move _WORKSPACE_INIT_FAILED check before MnnvlMemory.initialize() - Release workspace/mnnvl_mem on failure to prevent CUDA memory leak - Broaden exception handling to match PR NVIDIA#13235 pattern Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>

ziyixiong-nv · 2026-04-22T03:12:12Z

/bot run

tensorrt-cicd · 2026-04-22T03:18:08Z

PR_Github #44865 [ run ] triggered by Bot. Commit: dd0769e Link to invocation

tensorrt-cicd · 2026-04-22T07:33:22Z

PR_Github #44865 [ run ] completed with state SUCCESS. Commit: dd0769e
/LLM/main/L0_MergeRequest_PR pipeline #35203 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bobboli · 2026-04-22T08:29:20Z

Please hold on as i find the repair bot's repro-ed phenomenon is different from the original bug description.

bobboli · 2026-04-22T08:32:03Z

+        # be broken (detected by NVLinkOneSided workspace init failure). DeepEP also
+        # relies on NVSHMEM/symmetric memory internally, so it would hang during
+        # forward pass if the NVLink memory infrastructure is unavailable.
+        if NVLinkOneSided._WORKSPACE_INIT_FAILED:


The role of detecting whether NVLink Symmetric memory is supported shouldn't be dedicated to a specific communication backend, which is not reliable.

ziyixiong-nv requested a review from a team as a code owner April 18, 2026 00:26

ziyixiong-nv requested a review from yuxianq April 18, 2026 00:26

github-actions Bot assigned ziyixiong-nv Apr 18, 2026

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

ziyixiong-nv changed the title ~~[https://nvbugs/5859886][fix] [TensorRT-LLM][L0][Post-Merge][main]accuracy/test_~~ [https://nvbugs/5859886][fix] Skip DeepEP when NVLink symmetric memory init fails Apr 18, 2026

ziyixiong-nv force-pushed the repair-bot-bug5859886 branch from 8f778a3 to ef935a1 Compare April 20, 2026 02:22

yuxianq reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py

yuxianq reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py

yuxianq reviewed Apr 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py Outdated

yuxianq mentioned this pull request Apr 21, 2026

[https://nvbugs/6084764][fix] Cache NVLinkOneSided init failure to prevent OOM from repeated MnnvlMemory alloc #13235

Closed

2 tasks

yuxianq approved these changes Apr 22, 2026

View reviewed changes

bobboli requested review from bobboli and xxi-nv April 22, 2026 08:26

bobboli reviewed Apr 22, 2026

View reviewed changes

Conversation

ziyixiong-nv commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ziyixiong-nv commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

ziyixiong-nv commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ziyixiong-nv commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

tensorrt-cicd commented Apr 22, 2026

Uh oh!

bobboli commented Apr 22, 2026

Uh oh!

bobboli Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ziyixiong-nv commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading